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Resumen 



El presente trabajo reporta los aspectos mas relevantes del problema de aprendizaje 
de estructuras de redes de Markov a partir de datos. Este problema esta tomando cada 
vez mas importancia en el area de aprendizaje de maquinas, y en gran cantidad de 
areas que aplican el aprendizaje de maquinas. Las redes de Markov, junto a las redes 
de Bayes, son modelos probabilisticos grdficos, un formalismo ampliamente utilizado 
para manejar distribuciones de probabilidad en sistemas inteligentes. El aprendizaje 
de estos modelos a partir de datos ha sido un area extensamente aplicada para el caso 
de las redes de Bayes, no asi para el caso de redes de Markov, dada su intratabilidad 
computacional. Sin embargo esta situation se esta revirtiendo, dado el crecimiento 
exponential de la capacidad de las computadoras, la gran cantidad de datos digitales 
disponibles, y la investigation en nuevas tecnologias de aprendizaje. Este trabajo hace 
incapie en una tecnologia llamada aprendizaje basado en independencias, que permite 
el aprendizaje de la estructura de independencias de dichas redes a partir de los datos, 
de un modo eficiente y solido, cuando la cantidad de datos disponibles es suficiente, 
y los datos utilizados son un muestreo representative de la distribution subyascente. 
En el analisis de dicha tecnologia, este trabajo reporta los algoritmos pertenecientes al 
estado del arte actual para aprendizaje de estructuras de redes de Markov, discutiendo 
sus limitaciones actuales, y proponiendo una serie de problemas abiertos donde es 
posible trabajar para producir avances en el area, en terminos de calidad y eficiencia. 
El paper concluye abriendo una discusion respecto a como desarrollar un formalismo 
general para mejorar la calidad de las estructuras aprendidas, cuando los datos son 
insuficientes. 
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Abstract 



This work reports the most relevant technical aspects in the problem of learning the 
Markov network structure from data. Such problem has become increasingly important 
in machine learning, and many other application fields of machine learning. Markov 
networks, together with Bayesian networks, are probabilistic graphical models, a widely 
used formalism for handling probability distributions in intelligent systems. Learning 
graphical models from data have been extensively applied for the case of Bayesian 
networks, but for Markov networks learning it is not tractable in practice. However, this 
situation is changing with time, given the exponential growth of computers capacity, 
the plethora of available digital data, and the researching on new learning technologies. 
This work stresses on a technology called independence-based learning, which allows the 
learning of the independence structure of those networks from data in an efficient and 
sound manner, whenever the dataset is sufficiently large, and data is a representative 
sampling of the target distribution. In the analysis of such technology, this work 
surveys the current state-of-the-art algorithms for learning Markov networks structure, 
discussing its current limitations, and proposing a series of open problems where future 
works may produce some advances in the area in terms of quality and efficiency. The 
paper concludes by opening a discussion about how to develop a general formalism for 
improving the quality of the structures learned, when data is scarce. 
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1 Motivation 



Nowadays intelligent systems have to reason in realistic domains, storing its knowledge of 
the world, and supporting efficient inference, even when exceptions occur. This is called 
in the literature as reasoning under uncertainty. A popular approach taken for reasoning 
under uncertainty is the use of probabilistic models, a statistical analysis tool to make 
statistical inference. The statistical inference process is used for drawing conclusions from 
data by calculating the probability of propositional sentences. An example representation 
of a probabilistic model is the tabular probabilistic model, a function represented as a table 
that assigns a probability to every possible complete assignment in a domain, such that 
the sum of the probabilities adds up to 1. Figure 1 illustrates an abstract tabular model 
for a domain with n binary variables V = {Xq, ...,X n -±}, consisting on 2 n tuples, one per 
possible configuration of variables. 
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Figure 1: An example tabular model over n binary random variables, with 2 n numerical 
parameters. 

However, the tabular model presents computational and semantical limitations. First, 
its storage requirements are exponential in the number of variables, and the size of its re- 
spective domains. When domains of variables are continuous, such table would be infinite, 
and in practice some mathematical functions can be used. Nonetheless in this work the 
attention is restricted only to discrete distributions, so continuous variables may be con- 
sidered as discretized variables. Second, interesting queries usually do not involve all the 
variables, and the cost of computing marginal and conditional probabilities would result 
in exponential summations of variable combinations. Third, such representation does not 
have clear semantics for humans. The pattern human knowledge shows has probabilistic 
judgments on a small number of propositions. Therefore, conditional independences are a 
natural way for representing probability distributions. It is common for people to judge a 
three-place relationship of conditional dependency, i.e., X influences Y, given Z. 

Using independences may reduce the exponential requirements of the tabular model. 
For example, just assuming that all the n variables in Figure 1 are mutually independent 
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allows decomposing the joint probability distribution as 

n-l 

Pr(Xo,...,X n _!) = JJPr(Xi) 

i=0 

Such decomposition requires a polynomial number (n) of exponentially smaller tables with 
only two rows. Figure 2 illustrates a model assuming that all the binary variables are 
mutually independent, consisting only on n tables with 2 tuples each. 
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Figure 2: An example model assuming that all the variables of the domain are mutually 
independent, with n tables of only 2 numerical parameters each. 

To address all these problems, namely the exponential storage requirements, the expo- 
nential cost of computing marginal and conditional probabilities, and the lack of explicitness 
of the model, several researchers in the late 80's created the probabilistic graphical models, 
or simply, graphical models, a well-established formalism for representing compactly joint 
probability distributions. They are composed by i) an independence structure for encoding 
the independences present in the distribution, and ii) a set of numerical parameters, as a 
list of marginal probability distributions. Such representation is explained in more detail 
in Section 2. 

The most important types of graphical models are Bayesian networks and Markov net- 
works (Pearl, 1988). The well-known Bayesian networks are graphical models for encoding 
distributions where dependencies are representable by a directed acyclic graph. Markov 
networks (also known as Markov Random Fields, undirected graphical models, or simply 
undirected models) encode distributions where dependencies are representable by an undi- 
rected graph. Three most influential textbooks on this topic published in the last three 
decades are (Pearl, 1988), (Lauritzen, 1996), and (Koller and Friedman, 2009). 

There is a long list of applications of graphical models in a wide range of fields dur- 
ing recent years. Some examples are present in the areas of computer vision and image 
analysis, as (Besag et al., 1991) that gives two examples, one in archeology, the other in 
epidemiology; in (Anguelov et al., 2005), addressing the problem of segmenting 3D scan 
data into objects or object classes; or (Li, 2001), a complete textbook that presents an ex- 
position of Markov Random fields to image restoration and edge detection in the low-level 
domain, and object matching and recognition in the high-level domain. More examples are 
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present in the area of spatial data mining and geostatistics, as those presented in the text- 
book of Cressie (Cressie, 1992), where Markov Random Fields are emphasized for modeling 
spatial lattice data; or more recently, the work of Shekhar et al., (Shekhar et al., 2004) 
that presents spatial analysis methods and applications for Markov Random Fields in a 
wide range of fields, as biology, spatial economics, environmental and earth science, ecol- 
ogy, geography, epidemiology, agronomy, forestry and mineral prospection. There are also 
several examples for disease diagnosis, as (Schmidt et al., 2008) that presents a method 
for detecting coronary heart disease processing ultrasound images of echocardiograms; or 
in the area of computational biology, as (Friedman et al., 2000) that proposes the use of 
Bayesian networks for discovering interactions among genes. More applications of graphical 
models are present for evolutive optimization searching (Miihlenbein and Paafi, 1996), as 
(Larrahaga and Lozano, 2002) which describes the use of Bayesian networks for modeling 
the probability distribution of individuals with high fitness in evolutive algorithms, or more 
recently, (Alden, 2007; Shakya and Santana, 2008) proposing Markov networks for the same 
purpose. Further examples are shown for Information Retrieval (Metzler and Croft, 2005; 
Cai et al., 2007), for modeling term dependencies using Markov Random Fields; and for 
malware propagation (Karyotis, 2010), for analyzing the spatial and contextual dependen- 
cies of malware propagation, also using Markov Random Fields. There are many other 
interesting examples that could be part of this list. 

The framework provided by probabilistic graphical models supports three critical capa- 
bilities to intelligent systems, as highlighted in the textbook of Koller and Friedman: 

i) Representation: a compact and declarative model of the knowledge based on graphs. 
On one hand, such models are compact by providing a representation of conditional 
independences present in a probability distribution which is efficient and computa- 
tionally tractable. The compact representation of graphical models is achieved by 
exploiting a principle property present in many distributions: variables tend to inter- 
act directly only with very few others. On the other hand, since they are graphical, 
they are declarative, and a human expert can understand and evaluate its semantics 
and properties. 

ii) Inference: given a graphical model, the most fundamental and yet highly non-trivial 
task is to compute marginal distributions of one or a few variables. This task is usually 
referred as inference. Through marginalization it is possible to compute conditionals, 
posteriors, and make predictions. Inference is also a sub-routine of learning tasks, and 
is therefore the most elementary sub-routine of graphical models. However, as proven 
by (Cooper, 1990), exact inference is NP-hard in general. There are several meth- 
ods for working directly with the structure of graphical models, that are in practice 
orders of magnitude faster than manipulating explicitly the joint probability distri- 
bution. The textbook (Koller and Friedman, 2009) provides an extensive discussion 
on this topic, and describes the more popular methods used, such as variable elim- 
ination, Monte Carlo methods, and loopy belief propagation. Other recent works are 
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Tree-reweighted message-passing (Wainwright et al., 2003), Power EP (Minka, 2004), 
Generalized belief propagation (Yedidia et al., 2004), and Variational message-passing 
(Winn and Bishop, 2005). A free and open source library providing implementations 
of various exact and approximate inference methods for graphical models were pub- 
lished recently by (Mooij, 2010). 

iii) Learning: constructing graphical models can be made whether by a human expert or 
by learning it automatically. There are many algorithms that model the probability 
distribution of historical data, returning a graphical model as the solution. They 
are useful since expert knowledge is not always enough to design a proper model. 
Therefore, some authors consider these algorithms as a tool for knowledge discovery. 
Moreover, when constructing models for a specific problem it is possible to use the 
data-driven approach, using some part of the model provided by an expert, and 
filling the details automatically, by fitting the model to data. The important number 
of success stories in the recent years resulted in some authors, such as Koller and 
Friedman in their textbook, claiming that models produced by this process are usually 
much better than those purely hand constructed. 

In this work is reviewed the specific problem of learning the independence structure of a 
Markov network. This is an interesting problem that has resulted in important contributions 
during recent years, although many of its core difficulties remain a challenge and are under 
intense work. This survey focuses on a technology called independence-based learning, 
which allows to infer the independence structure of those networks from data in an efficient 
and sound manner, whenever data is sufficient and a representative sample of the target 
distribution. An analysis of the current state-of-the-art algorithms for learning Markov 
networks structure using such technology is presented, discussing its current limitations, 
and its potential for improving the quality and the efficiency of current approaches. 

The rest of the document is structured as follow. Section 2 presents an overview of 
Markov networks representation. Section 3 discusses the problem of learning Markov net- 
works from data. Section 4 provides a review of current independence-based Markov net- 
work structure learning algorithms. Section 5 analyzes the surveyed independence-based 
algorithms and discusses their relative advantage as well as disadvantages, concluding with 
a series of open problems that remain in the area. Finally, Section 6 presents concluding 
remarks. 

2 Markov networks representation 

This section overviews the representation of a specific type of graphical models: Markov 
networks. Graphical models in general consist in a qualitative, and a quantitative compo- 
nent for representing a probability distribution P. Such distribution is given over a domain 
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of n variables, denoted V = {Xq, X n -i}. The qualitative component is the indepen- 
dence structure G (also known as the network, or the graph) of the model, that represents 
conditional independences among the domain variables. The quantitative component is a 
set of numerical parameters 9 for quantifying the relationships in G, as a list of marginal 
probability distributions. 

2.1 The independence structure 

The independence structure is a compact representation of conditional independences present 
in the underlying distribution P. Two variables X and Y are independent conditioned in a 
set of variables Z when knowing the value of Y tells me nothing new about X if I already 
know the values of variables in Z. In this work such conditional independence is denoted 
as (X_LLY|Z), and (X_Ji-Y|Z) for conditional dependence. 




-() — — ( ) — ( 
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(a) 



(b) 



Figure 3: Two example undirected independence structures, (a) An irregular graph with 
different grade of connectivity for distinct nodes, (b) A regular lattice where variables 
belong to a domain in a spatial problem. 



The structure G of a Markov network is an undirected graph with n nodes, each one 
representing a random variable in the domain. The edges in the graph encode conditional 
independences among the variables. Figure 3 shows two example undirected structures, 
both representing domains with n = 12 variables V = {Xq, . . . ,Xn}. The first in Fig- 
ure 3 (a) is an irregular graph with different grade of connectivity for distinct nodes. The 
second in Figure 3 (b) is a regular lattice where variables belong to a domain in a spatial 
problem, as usually used for representing 2D images, or in two dimensional Ising spin glasses 
models (mathematical models of ferromagnetism in statistical mechanics). 
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The independence structure is a map of the independences in the underlying distri- 
bution, and such independences can be read from the graph through vertex separation, 
considering that each variable is conditionally independent of all its non-neighbor variables 
in the graph, given the set of its neighbor variables. For example, in Figure 3 (a) vari- 
ables Xq and X% are conditionally independent, given the set of variables {Xi,X2}. In 
the toroidal lattice of Figure 3 (b), X§ is conditionally independent of all the non-adjacent 
variables, given its neighbor variables {X\, X4, Xq, Xg}. 

2.1.1 Correctness of the structure 

For representing correctly a probability distribution P by a Markov network, G must be 
a map of the independences present in P. As proved in (Pearl, 1988), a graph G is called 
an independence-map (or I-map, for short) of a distribution P when all the independences 
encoded in the graph exist in the underlying distribution P. 

Definition 1. I-map (Pearl, 1988) [p. 92] 

A graph G is an I-map of a distribution P if for all disjoint subsets of variables X, Y 
and Z, the following is satisfied: 



where (X_LLY|Z) G are the independences encoded by G, and (X,Y,Z)p are the indepen- 
dences existent in the underlying distribution P. 

Similarly, G is a dependency-map (D-map) when 



When G is an I-map it is guaranteed that nodes found to be separated correspond to 
independent variables, but it is not guaranteed that all those shown to be connected are 
dependent. Conversely, when G is a D-map it is guaranteed that the nodes connected in G 
are dependent in the distribution P. Fully-connected graphs are trivial I-maps, and empty 
graphs are trivial D-maps. A distribution P is said to be a perfect-map of P if it is both 
an I-map and a D-map. 

An axiomatic characterization of the family of relations that are isomorphic to vertex 
separation in graphs is given by the concept of graph-isomorphism. Basically, a distribution 
P is a graph-isomorph when its independences among variables can be encoded by an 
undirected graph. 

Definition 2 (p. 93). A distribution is said to be a graph-isomorph if there exists an 
undirected graph G that is a perfect-map of P, i.e., for every three disjoint subsets X, Y 
and Z, we have 






(2) 




(3) 
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A necessary and sufficient condition for a distribution P to be a graph-isomorph is that 
(X, Y, Z)p satisfies the following axioms of independences, introduced in 1985 by Pearl and 
Paz (Pearl and Paz, 1985). There is another set of axioms for learning Bayesian networks, 
but they are omitted here. 

Symmetry (X^YjZ) 44> (Y_U_X|Z) 

Decomposition (X1LY U W|Z) (X_U_Y|Z) & (X_U_W|Z) 

Transitivity (XiLY|Z) =► (X_LLA|Z) or (A_U_Y|Z) (4) 

Strong union (X_LLY | Z) => (X1Y | Z U W) 

Intersection (XiLYjZ U W) & (X_U_W|Z U Y) => (X_LLY U W|Z), 

where X, Y, Z and W are all disjoint subsets of the set of all the variables in the domain 
V, and A stands for a single variable, not in X U Y U Z U W. The intersection axiom is 
valid only for strictly positive probability distributions. This list of axioms represents the 
relationships that hold among the independences encoded by the graph. 

In summary, when the distribution P is a graph-isomorph, exists a graph G that is a 
perfect-map for P. For representing a distribution P may be used any graph G which is an 
I-map of P. However, the more independences of the underlying distribution encoded in 
the graph, the better is the model in complexity and accuracy when used for inference. As- 
suming graph-isomorphism is an important decision, since not all the existent distributions 
may be represented by an undirected graph. For example, there are distributions that may 
be represented by an acyclic directed graph, and Bayesian networks are the correct model 
to use, and there are other distributions that cannot be encoded by a graph. 

2.1.2 The Markov blanket concept 

This section describes the concept of Markov blanket, a central theoretical concept in the 
representation of distributions, introduced by Pearl in 1988 (Pearl, 1988). The Markov 
blanket of a variable is the only knowledge needed to predict the behavior of that vari- 
able. Hence, this concept takes relevance for a wide variety of applications where local 
relationships to some variables are significant. 

Definition 3. Markov blanket 

The Markov blanket of a variable X is a minimal set, denoted here MB X , conditioned 
on which all other nodes are independent of X, that is, 

Vy € V - {MB X }, (XALY\MB X ), (5) 
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That is, the Markov blanket of a variable is the smallest set of variables that shields it 
from the probabilistic influence of the variables not in the blanket. From a graphical view 
point, the Markov blanket of a variable X is identical to its neighbors in the graph. 

In the textbook of (Pearl, 1988) is proved formally that, for strictly positive distri- 
butions, the independence structure can be constructed by piecing together the Markov 
blanket of all the variables of the domain, connecting with an edge every two variables X 
and Y, such that X belongs to the Markov blanket of Y. Also there is a proof stating that 
every variable I £ V in a distribution that satisfies the Pearl's axioms shown in Equa- 
tions (4) has a unique Markov blanket. As only strictly positive distribution satisfies the 
Intersection axiom, it is valid only for positive distributions. 

2.2 Parameterization 

This section explains how to quantify the relationships encoded in G. Although this work 
only addresses the problem of structure learning, the quantitative aspects of Markov net- 
works are briefly explained for better motivating our work. Bellow is described a factoriza- 
tion method for constructing the Gibbs distribution for an arbitrary undirected graph G, 
provided in (Pearl, 1988) as follows: 

i) Identification of the maximal subgraphs whose nodes are all adjacent to each other, 
called the maximal cliques of G. For example, the graph in Figure 3 (a) shows a 
maximal clique of size 4 among the nodes corresponding to variables {7,8,10,11}, 
two maximal cliques of size 3 among nodes {2,3,5} and {8,9,11}, and the rest of 
edges are maximal cliques of size 2. In Figure 3 (b) the size of all the cliques is 2. 

ii) For each clique in the set of all the cliques in the graph c £ C, assign a non- negative 
potential function g c (X c ) (where X c is the set of variables that belong to the clique 
c) measuring the relative degree of compatibility associated with each possible con- 
figuration of X c . Usually each potential function is represented by a table with a 
numerical parameter assigned for each possible complete assignment of the variables 
that compose the clique, like the tabular model shown in Figure 1, but including only 
the variables that compose the clique c. A difference with the tabular model is that 
here the parameter values are not normalized. 

iii) Form the product Y\ 5c(X c ) of the potential functions over all the cliques. 

ceC 

iv) Construct the Gibbs distribution by normalizing the product over all possible value 
combinations of the variables in the system 

P(X ,..,X n - 1 ) = ^l[g c (X c ) (6) 

cec 
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where Z is the partition function, or normalization constant, computed as 

Z = E Il3c(Xc) (7) 

Xo,--,X n -i cGC 

Using the Hammersley-Clifford theorem it is possible to prove that the general form of 
the Gibbs distribution of Equation (6) embodies all the conditional independences encoded 
in the graph G. Such form of the Gibbs distribution presents some difficulties. First, it is 
difficult to discern the meaning of the potential functions. Second, the computational cost 
of calculating the partition function Z is exponential, as it requires an exponential sum 
over all possible assignments of the complete set of variables. 

3 The Markov networks learning problem 

In this section are discussed the difficulties that arise in the challenging task of learning 
Markov networks from historical information. This task is only possible whenever the size 
of the input dataset is enough, and the data is a representative sample of the underlying 
distribution P. When these conditions are satisfied it is possible that some algorithms learn 
a model for representing P by exploring and analyzing D. The input dataset D contains 
historical information commonly structured in the tabular format, a standard format in 
machine learning. This is a file that contains a table with a column per random variable 
in P, and the rows are the datapoints, each one being a complete assignment for all the 
variables. For example, a datapoint for a domain with n = 4 random binary variables 
V = {X ,X 1 ,X 2 ,X 3 } may be (X = 0,X 1 = l,X 2 = 1,X 3 = 0). The algorithms dis- 
cussed in this work ignore the problem of missing values, which is solved by known yet 
computationally challenging statistical techniques. 

Learning a Markov network from data is a problem that consists in learning both the 
structure G and the parameters 6. Of course, the best possible structure learned is a 
perfect-map, that is, a model that contains a structure encoding all the dependences and 
the independences present in P. However, every model containing a structure which is an 
I-map of P is a good solution. The closer to a perfect-map, the better is the structure 
learned, and the better is the resulting Markov network for representing P. When learning 
a model for large domains, a desirable property of the model is the sparsity, since densely 
connected models require too many parameters, and make exact and even approximate 
inference computationally intractable. 

3.1 Goals of Markov networks learning 

For evaluating the merits of a model learning method, it is important to consider the goal of 
learning. Clearly, learning the complete model (structure plus parameters) is the best, but 
due to computational, spatial or sampling limitations it may not be possible in practice. 
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For that, other less ambitious goals are often considered in practice, such as the three main 
goals of learning discussed by Koller and Friedman (2009), that are : 

i) Density estimation: A common reason for learning a Markov network is to use 
it for some inference task. When formulating the goal of learning as one of density 
estimation, the goal is to construct a model M so that the defined distribution is 
"close" to the underlying distribution P. A common metric for evaluating the quality 
of such approximation is the use of the likelihood of the data Pr(D \ M). However, 
this goal assume that the overall distribution P is needed. 

ii) Specific prediction tasks: The goal is predicting the distribution of a particular 
set of variables Y, given certain set of variables X. When the model is used only 
to perform a particular task, if the model is never evaluated on predictions of the 
variables X, it is better to optimize the learning task for improving the quality of 
its answers to Y. It has been the goal of a large fraction of the work in machine 
learning. For example, consider the problem of documents classification for a given 
set of relevant words of a document, and a variable that labels the topic of the 
document. Other well known example is the task of image segmentation, where the 
goal is the prediction of class labels for all the pixels in the image, given the image 
features. 

iii) Knowledge discovery: The goal is to learn the correct structure of the underly- 
ing distribution. There are some cases when the learned structure can reveal some 
unknown important properties of the domain. It is a very different motivation for 
learning the distribution. An examination of the learned structure can show depen- 
dences among variables, as positive or negative correlations. In a knowledge discovery 
application, it is far more critical to assess the confidence in a prediction, taking into 
account the extent to which it can be identified given the available data and the 
number of hypotheses that would cause similar observed behavior. For example, in 
a medical diagnosis domain, we may want to learn the structure of the model to 
discover which predisposing factors lead to certain diseases and which symptoms are 
associated with different diseases. 

3.2 Parameters estimation 

Markov network parameters estimation is usually used to choose the value of the parameters 
by fitting the model to data, because tunning parameters manually is often difficult, and 
learned models often exhibit better performance. This task has shown to be a NP-hard 
problem by (Barahona, 1982). 

For estimating the parameters the most common method proposed is maximum-likelihood 
estimation, possibly using some regularization as additional parameter prior. Unfortunately, 
evaluating the likelihood of a complete model requires, for every set of parameters proposed 
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during the maximum-likelihood estimation process, the computation of the partition func- 
tion Z, which is used for normalizing the product over all possible value combinations of 
the variables of the domain, as shown in Equation (7). Although it is not possible to op- 
timize the maximum-likelihood in a closed form, it is guaranteed that the global optimum 
can be found, because it is a concave function. As a result, there are in the literature 
some approximations and heuristics for reducing the cost of parameters estimation, using 
iterative methods such as simple gradient ascent, or other sophisticated optimization al- 
gorithms (Minka, 2001; Vishwanathan et al., 2006). Unfortunately, this problem remains 
intractable in practice, because the use of the partition function couples all the parameters 
across the network, requiring several inference steps on the network (iterative methods with 
interleaved inference) . 

For reducing the cost of parameters estimation, other solutions have been proposed. 
Pseudolikelihood (Besag, 1977) and Score Matching (Hyvarinen and Dayan, 2005) are some 
tractable approximate alternatives. The loopy belief propagation (Pearl, 1988; Yedidia et al., 
2005) and its variants (Wainwright and Jordan, 2008), propose the use of an approximate 
inference technique for approximating the gradient of the maximum likelihood function. 
Anyway, as this solution can be highly non-robust, other solution outperforming loopy 
belief propagation is provided in (Ganapathi et al., 2008). 

For avoiding overfitting, many of these scoring methods commonly need the use of a 
regularization term adding an extra hyper-parameter, whose best value has to be found em- 
pirically, for example, running the training stage for several values of the hyper-parameter, 
potentially with cross-validation. 

3.3 Structure learning approaches 

The two broad approaches for learning the structure of Markov networks from data are 
score-based and independence-based approaches. The first is intractable in practice, and the 
latter is efficient but presents quality problems. Both approaches have been motivated by 
distinct learning goals (those described in Section 3.1). Generally, score-based approaches 
may be better suited for the density estimation goal, that is, tasks where inferences or 
predictions are required. As explained below in Section 3.3.1, score-based methods learn 
the complete Markov network (structure and parameters). There is an overwhelmingly use 
of Markov networks for such settings, such as image segmentation and others, where exists 
a particular inference task in mind. Instead independence-based ones are better suited for 
the remaining goals, that is, for specific prediction tasks, and knowledge discovery. On one 
hand, independence-based algorithms are commonly used for tasks as feature selection for 
classification, since it is possible to perform local discovery for a particular set of variables of 
interest (more details in Section 3.3.2). On the other hand, independence-based algorithms 
are suited for knowledge discovery tasks, that is, tasks where understanding the interactions 
among variables in a domain have the greatest importance, or whether the structure is 
viewed purely as a predictive tool, for example, econometrics, psychology, or sociology. 
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Since this work focuses on the independence-based approach for Markov networks 
structure learning methods, Sections 4 and 5 only discuss in detail the state-of-the-art 
independence-based algorithms. 

3.3.1 Score-based approach 

Score-based algorithms were proposed as of 1995 for learning the structure of Bayesian 
networks, in the works of (Lam and Bacchus, 1994) and (Heckerman et al., 1995), and later 
proposed for learning the structure of Markov networks, in the works of (Delia Pietra et al., 
1997) and (McCallum, 2003). Such algorithms approach the problem as an optimization 
over the space of complete models, looking for the one with maximum score. The goal of 
score-based algorithms is to find the model that maximizes its score. Traditional score- 
based algorithms for learning Markov network structure perform a global search to learn a 
set of potential functions that captures accurately high-probability regions of the instance 
space of complete models. 

The standard approach for learning the structure of Markov networks with a score-based 
approach is the Delia Pietra et al.'s algorithm. This algorithm learns the structure by 
inducing a set of potential functions from data. Its strategy is based on a top-down search, 
that is, a general-to-specific search. This algorithm starts with a set of atomic potentials 
(that is, just the variables of the domain). Then, it creates a set of candidate potentials 
in two ways. First, each potential currently in the model is conjoined (i.e., associated) 
with every other potential in the model. Second, each potential in the model is composed 
with each atomic potential. Then, for efficiency reasons, the parameters are learned for each 
candidate potential, assuming that the parameters of all other potentials remain unchanged. 
When setting the parameters, it uses the Gibbs sampling for inference. Then, for each 
candidate potential the algorithm evaluates how much adding such potential would increase 
the log-likelihood, which is the score used by this algorithm. The potential that maximizes 
this measure is added. When no one candidate potential improves the score of the model, 
the procedure ends. Other algorithm using the same approach is proposed in (McCallum, 
2003). It is the same algorithm than proposed by Delia Pietra, but performing an efficient 
heuristic search over the space of candidate structures, for inducing automatically potentials 
that most improve the conditional log-likelihood. However, such general-to-specific searches 
are inefficient because they test many potential variations with no support in the data, and 
because they are highly prone to local optima (Davis and Domingos, 2010). 

Recently, other alternative approaches have been proposed. The approach of (Lee et al., 
2006), (Honing and Tibshirani, 2009), and (Ravikumar et al., 2010) propose to couple pa- 
rameters learning and potentials induction into one step by using Li-regularization, which 
forces most numerical parameters to be zero. They approach the problem as an optimization 
problem, providing a large initial potential set, with all the possible potentials of interest. 
Then, after learning, model selection occurs by selecting those potentials with non-zero pa- 
rameters. For efficiency reasons, the approaches of Honing and Tibshirani, and Ravikumar 
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et al., only construct pairwise networks (networks involving for factorization only cliques of 
size two or one). Instead, the algorithm of Lee et al. can learn arbitrarily long potentials, 
but in practice it has been evaluated only for inducing potentials of length two (that is, for 
learning pairwise networks). 

A recent alternative approach is proposed by (Davis and Domingos, 2010), called the 
Bottom-up Learning of Markov Networks (BLM) algorithm. BLM starts with each complete 
training example as a long potential in the Markov network. Then, the algorithm iterates 
through the potential set, generalizing each potential to match its k- nearest previously 
unmatched examples by dropping variables. When the new generalized potential improves 
the score of the model, it is incorporated to the model. The loop ends when no generalization 
can improve the score. 

However, all these approaches are often slow for two reasons. First, the size of the 
search space of structures is intractable in the number of variables. Second, for evaluating 
the score at each step it is necessary to compute the score, requiring the estimation of the 
numerical parameters, which is a NP-hard task, as explained in Section 3.2. 

3.3.2 Independence-based approach 

Independence-based (also known as constraint-based) algorithms work by performing a 
succession of statistical independence tests for discovering the independence structure of 
graphical models (Spirtes et al., 2000). These algorithms exploit the semantics of the in- 
dependence structure, casting the problem of structure learning as an instance of the con- 
straint satisfaction problem, where the constraints are the independences present in the 
input dataset (and therefore, in the underlying distribution), and the goal is to find a 
structure encoding all such independences. 

Each independence test consults the data for responding to a query about the condi- 
tional independence among some input random variables X and Y, given some conditioning 
set of variables Z, resulting in an independence assertion (JT_LI_y|Z), or (X_j^L5^|Z) for a 
dependence assertion. The computation cost of statistical tests is proportional to the num- 
ber of rows in the input dataset D, and the number of variables involved. Examples of 
independence tests used in practice are Mutual Information (Cover and Thomas, 1991), 
Pearson's x 2 an d G 2 (Agresti, 2002), the Bayesian test (Margaritis, 2005), and for contin- 
uous Gaussian data the partial correlation test (Spirtes et al., 2000). Such independence 
tests compute a statistical value for a triplet of variables {X,Y, Z), given an input dataset, 
and decide independence or dependence comparing it with a threshold. For instance, x 2 
and G 2 use the p-value, which is computed as the probability of obtaining a test statistic at 
least as extreme as the one that was actually observed, assuming that the null hypothesis 
is true (that is, variables are dependent). The null hypothesis is rejected when the p-value 
is less than the significance level a, which is often 0.05 or 0.01. When the null hypothesis 
is rejected, the result is said to be statistically significant. 

An elegant, efficient and scalable strategy used by several independence-based algo- 
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rithms in the literature is called the local-to- global strategy, presented in a recent work of 
(Aliferis et al., 2010b). This is a generalization of previous algorithms using such strategy. 
Algorithm 1 shows the outline of this theoretically sound and straightforward procedure. 

Algorithm 1 LGL for Markov networks 

1: Learn MB^' for every variable X{ € V. 

2: Piece-together the global structure using an "OR rule". 



Such strategy suggest to construct the independence structure by dividing the problem 
in n different Markov blanket learning problems. The learning of Markov blanket is gener- 
alized by Aliferis et al., for learning Bayesian networks, in the Generalized Local Learning 
(GLL) framework (Aliferis et al., 2010a). Algorithms using a local-to-global strategy learn 
locally the Markov blanket of every variable in the domain, and then, construct a global 
structure linking each of these variables with every member of its Markov blanket using an 
"OR rule" (an edge exists between two variables X and Y when X € MB y or Y € MB X ). 

Independence-based algorithms arise as of 1993 for learning Bayesian networks, when 
Spirtes et al., published in the first edition of the (Spirtes et al., 2000) textbook the well- 
known algorithms SGS and PC. Then, other independence-based algorithms appeared 
in works about feature selection via the induction of Markov blanket, and works about 
Bayesian and Markov networks structure learning. For that reason, a series of independence- 
based algorithms for Markov blanket learning of Bayesian networks appeared, such as the 
Koller-Sahami (KS) algorithm (Koller and Sahami, 1996), the Grow-Shrink (GS) algorithm 
(Margaritis and Thrun, 2000), the Incremental Association Markov Blanket (IAMB) algo- 
rithm and its variants (Tsamardinos et al., 2003) , the Max-Min Parents and Children 
Markov Blanket (MMPC/MB) algorithm (Tsamardinos et al., 2006), the HITON-PC/MB 
algorithm (Aliferis et al., 2003), the Fast-IAMB algorithm (Yaramakala and Margaritis, 
2005), the Parent-Children Markov Blanket (PCMB) algorithm (Peha et al, 2007) and 
the Iterative Parent and Children Markov Blanket (IPC-MB) (Fu and Desmarais, 2008). A 
summary of the most important aspects of such algorithms is shown in Table 1, reproduced 
from the conclusions of a recent review of Markov blanket based feature selection wrote by 
Fu and Desmarais (2010). 

Independence-based algorithms for learning Markov network structure arise as of 2006, 
when (Bromberg et al., 2006, 2009) published the Grow-Shrink Markov Network (GSMN) 
algorithm and the Grow-Shrink Inference-based Markov Network (GSIMN) algorithm. Then, 
other independence-based algorithms appeared for Markov networks structure learning, 
such as the Particle Filter Markov Network (PFMN) algorithm (Bromberg and Margaritis, 
2007; Margaritis and Bromberg. 2009), and the Dynamic Grow Shrink Inference-based 
Markov Network (DGSIMN) algorithm (Gandhi et al., 2008). Other approach is proposed 
in (Bromberg, 2007; Bromberg and Margaritis, 2009), as a framework based on argumenta- 
tion for improving reliability of tests. In Section 4 all these independence-based algorithms 
for learning the structure of Markov networks are surveyed in detail. 



17 



Table 1: Summary of Markov blanket learning algorithms for Bayesian networks. 



Name 


Pub. 

Year 


Comments 


KS 


1996 


• 
• 
• 


Not Sound 

The first one of this type 

Requires specifying MB size in advance 


GS 


1999 


• 
• 

• 
• 


Sound in theory 

Proposed to learn Bayesian network via the induction of neighbors of 
each variable 

First proved such kind of algorithm 
Work in two phases: grow and shrink 


IAMB and 
its variants 


2003 


• 
• 
• 
• 
• 
• 


Sound in theory 
Actually variant of GS 
Simple to implement 
Time efficient 

Very poor on data efficiency 

IAMB's variants achieve better performance on data efficiency than 
IAMB 


MMPC/MB 


2003 


• 
• 
• 
• 


Not sound 

The first to make use of the underling topology information 
Much more data efficient compared to IAMB 
Much slower compared to IAMB 


HITON- 


2003 


• 
• 

• 
• 


Not sound 

Another trial to make use of the topology information to enhance data 
efficiency 

Data efficiency compared to IAMB 
Much slower compared to IAMB 


Fast-IAMB 


1996 


• 
• 
• 
• 


Sound in theory 

No fundamental difference as compared to IAMB 
Add candidates more greedily to speed up the learning 
Still poor on data efficiency performance 


PCMB 


2006 


• 
• 
• 
• 
• 


Sound in theory 

Data efficient by making use of topology information 
Poor on time efficiency 
Distinguish spouses from parents/children 
Distinguish some children from parents/children 


IPC-MB 


2008 


• 
• 
• 
• 
• 
• 


Sound in theory 

Most data efficient compared with previous ones 
Much faster than PCMB on computing 
Distinguish spouses from parents/children 
Distinguish some children from parents/children 
Best trade-off among this family of algorithms 
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There are several advantages of independence-based algorithms. First, they can learn 
the structure without interleaving the expensive task of parameters estimation (contrary to 
score-based algorithms, as explained before), reaching sometimes polynomial complexities 
in the number of statistical tests. If the complete model is required, the parameters can 
be estimated only once for the given structure. Another important advantage of such 
algorithms is that they are sound, that is, when statistical tests outcomes are correct, the 
structure found represents correctly the underlying distribution. However, they are correct 
under the following assumptions: 

i) the distribution of data is a graph-is omorph 

ii) the underlying distribution is strictly positive 

Hi) the outcomes of tests are reliable 

The third condition for soundness is an important problem of independence-based al- 
gorithms. When the dataset used for learning is not sufficiently large, the outcomes of 
tests are incorrect, and such tests are deemed unreliable. This problem of statistical tests 
is exacerbated exponentially with the number of variables involved (for some fixed size of 
dataset). For good quality, statistical tests require enough counts in their contingency ta- 
bles, and there are exponentially many of those (one per value assignment of all variables 
in the test). For example, Cochran (1954) recommends that the x 2 test must be deemed 
unreliable when more than 20% of these cells have an expected count of less than 5 data 
points. 

Another disadvantage of independence-based algorithms is that there is not any guar- 
antee about the quality of the complete model obtained by learning first the structure, and 
then fitting parameters for such learned structure. This is an approximation, and there is 
not experimental results published in the literature about independence-based methods for 
learning complete models. 

4 Independence- based algorithms for learning the Markov net- 
works structure 

This section reviews the independence-based structure learning algorithms for Markov net- 
works that have appeared in the literature. The review on this section covers a series of 
published algorithms that tackle such problem. 

4.1 The Grow-Shrink Markov Network algorithm 

The Grow-Shrink Markov Network (GSMN) algorithm was introduced by Bromberg et 
al., in (Bromberg et al., 2006, 2009) as the first independence-based structure learning 
algorithm for Markov networks in the literature. Such algorithm is an adaptation to Markov 
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networks of the GS algorithm of (Margaritis and Thrun, 2000) for learning the Markov 
blanket. 

The GSMN algorithm learns the global structure of a Markov network following the sim- 
ple outline of local-to- global algorithms shown in Algorithm 1, and using the GS algorithm 
outlined in Algorithm 2 for discovering the Markov blanket of the variables. GS maintains 

Algorithm 2 GS(A, V). 

1: S <— 0. 

2: sort V — {X} by increasing association with X 

3: /* Grow phase */ 

4: while 3Y € V - {X} s.t. (YJIX\ S), doS^SU {Y}. 

5: /* Shrink phase */ 

6: while 3Y € S s.t. (YALX\ S - {Y}), do S <- S - {Y}. 

7: return S 



a set called S (initialized empty in line 1) that contains the Markov blanket of the input 
variable X when the algorithm terminates. First, in line 2, GS performs an initialization 
phase that sorts by increasing association with X the rest of the variables of the domain, 
using an unconditional test between X and every variable Y € V — {X}. Then, the algo- 
rithm proceeds in two stages, the grow and shrink phases, using such ordering. During the 
grow phase (line 4) the algorithm increases the set S with every variable Y that is found 
dependent on X conditioning on the current state of S . By the end of this phase, the set S 
contains all members of the Markov blanket, but including potentially some false positives 
that are non-members. These false positives are removed during the shrink phase (line 6), 
where variables found independent of X conditioning on the set S are removed from S . 

The main advantages of GSMN are i) it is sound, and ii) it is efficient. The soundness 
of GSMN is proven theoretically by its authors, guaranteeing that a correct independence 
structure is found when statistical tests are reliable. This algorithm is efficient because 
it is polynomial in the number of independence tests for discovering the structure, each 
test requiring a polynomial time execution with respect to the domain size, and the size of 
the input dataset. A disadvantage of using GS is that unreliable statistical tests produce 
cascade errors, not only with incorrect outcomes, but also generating next incorrect tests 
during grow and shrink phases, producing errors cumulatively (Spirtes et al., 2000). 

Two other important algorithms for learning the Markov blanket of a variable, for 
Bayesian networks, are the Incremental Association Markov Blanket (IAMB) algorithm 
(Tsamardinos et al., 2003), and the HITON algorithm (Aliferis et al., 2003). Both algo- 
rithms have been proven empirically to be more robust than GS to the errors of statistical 
tests, by introducing two simple variants. On one hand, the IAMB algorithm only introduce 
a modification by interleaving the initialization step of ordering in the grow phase (i.e., in- 
terleaves lines 2 and 4 of Algorithm 2). By interleaving the sorting step in the grow phase, 
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IAMB maximizes the accuracy, reducing the number of false positives in the grow phase. 
On the other hand, the HITON algorithm aims to reduce the data requirements of IAMB, 
but introduces an additional modification in the criteria used for testing independence. In 
both grow and shrink phases, instead of only conditioning on its tentative Markov blanket 
S, HITON asks independence conditioning in any of the subsets of S (that is, every set 
Z C S — {y}). As statistical tests are more reliable while containing fewer variables, such 
modification exploits the Strong union axiom of Pearl, for improving the quality of inde- 
pendence tests when data is scarce. A disadvantage of the approach proposed by HITON 
is its exponential cost in | S | (i.e., the size of S) , but in general | S | is comparatively 
smaller than the size of the domain n. In summary, both algorithms are proven to be 
better in quality than GS, but both algorithms were designed for learning the structure 
of Bayesian networks, and there are not in the literature any work proposing a theoretical 
adaptation of such ideas for learning the complete structure of a Markov network, and 
evaluating empirically its performance. 

4.2 The Grow Shrink Inference Markov Network algorithm 

The Grow Shrink Inference Markov Network (GSIMN) algorithm was presented by Bromberg 
et al., in (Bromberg et al., 2006, 2009). This algorithm works in a similar fashion to that 
of GSMN algorithm, using the local-to-global strategy of Algorithm 1, and learning the 
Markov blanket of all the variables with the GS algorithm, but interleaving an inference 
step to reduce the number of tests required to learn the Markov blanket. By using for infer- 
ence a theorem called by the authors the Triangle theorem, GSIMN reduces the number of 
tests performed on data without affecting adversely the quality of the learned structures. It 
may be useful when using large datasets, or in distributed domains, where statistical tests 
are very expensive. 

GSIMN introduces the Triangle theorem, based on the Pearl's axioms shown in Sec- 
tion 2.1.1. This is a sound theorem for allowing to infer unknown independences from those 
known so far. 

Theorem 4 (Triangle theorem). Given Eqs. (4), for every variable X, Y, W and sets Z\ 
and Z 2 such that {X, Y, W} n Z 1 = {X, Y, W}nZ 2 = 0, 

(X^LW\Z\) A (W \JLY\Z 2 ) =>• {X4lY\Z 1 r\Z 2 ) 
(XALW\Z\) A {W^LY\Z\ U Z 2 ) => {XMY\Z X ). 

The first relation is called the "D-triangle rule" and the second the "I-triangle rule." 

When GSIMN tests some independence on data, first applies the Triangle theorem to the 
tests already done on data, to check if such independence assertion can be logically inferred. 
If the test cannot be inferred then this is done on data, and stored. For convenience, the 
algorithm determines the visit ordering (the order for local learning) in an attempt to 
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maximize the use of inferences. The results obtained with GSIMN show savings up to a 
40% in the running times of GSMN, obtaining comparable qualities. 

4.3 Particle Filter Markov networks algorithm 

The Particle Filter Markov networks algorithm (PFMN) is an independence-based algo- 
rithm for learning Markov network structures, introduced in (Bromberg and Margaritis, 
2007; Margaritis and Bromberg, 2009). Previous independence-based algorithms reviewed, 
such as the GSMN and GSIMN, use the local-to-global strategy. Instead, this algorithm 
learns directly a global structure as the solution. 

PFMN was designed for improving the efficiency of the GSIMN algorithm. This al- 
gorithm works performing statistical independence tests iteratively, by selecting greedily 
at each iteration the statistical test which eliminates the major number of inconsistent 
structures. This decision is taken by first modeling the learning problem with a Bayesian 
approach, selecting as the solution the structure G that maximizes its posterior probability. 
That is, given a dataset D, maximizes the posterior over structures. Formally, 

G* = argmaxPr(G | D). (8) 

G 

Since the direct computation of such probability is intractable, PFMN propose a generative 
model with independence tests which is an approximation to that posterior probability. 
With this model it is possible to compute efficiently such probability, given the information 
over a set of independences. Moreover, the authors claim that it is possible to demonstrate 
that, under the assumption of correctness of tests, the distribution of Pr(G | D) converges 
to a correct structure. 

This approach is useful in domains where independence tests are expensive, such as 
cases of very large data sets or in distributed domains. Results obtained by PFMN show 
improvements in running times up to 90% with respect to GSIMN, and comparable qualities 
on structures found by GSIMN and GSMN. 

4.4 The Dynamic Grow Shrink Inference-based Markov Network algo- 
rithm 

The Dynamic Grow Shrink Inference-based Markov Network (DGSIMN) algorithm was 
presented in (Gandhi et al., 2008). This is an extension of the GSIMN algorithm which, in 
the same way than GSIMN, uses the Triangle theorem for avoiding unnecessary tests. The 
outline of DGSIMN is similar to GSMN and GSIMN, using the local-to-global strategy of 
Algorithm 1, and the GS algorithm showed in Algorithm 2 for learning the Markov blanket 
of the variables, but interleaving a different inference step than GSIMN for reducing the 
number of tests performed. 

DGSIMN improves the GSIMN algorithm by dynamically selecting the locally optimal 
test that will increase the state of knowledge about the structure, by estimating the number 
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of inferred independences that will be obtained after executing a test, and selecting the one 
that maximizes such number of inferences. This helps decreasing the number of tests 
required to be evaluated on data, resulting in an overall decrease in the computational 
requirements of the algorithm. 

The results of experiments with the DGSIMN algorithm shows that it improves the fixed 
ordering of variables in the Markov blanket learning subroutine, improving the running 
times of GSIMN up to 85%, obtaining comparable qualities to GSMN. 



4.5 Argumentation for improving reliability 

Algorithms presented in previous sections are independence-based algorithms that focus on 
improving the efficiency, ignoring the important problem of the quality of learned structures, 
a problem that arises when statistical tests are not reliable, due to data scarceness. 

An independence-based approach for dealing with unreliable tests was presented in 
Bromberg and Margaritis (Bromberg, 2007; Bromberg and Margaritis, 2009), by modeling 
the problem of low reliability of independence tests as a knowledge base with independence 
assertions that may contain errors due to incorrect statistical tests performed, and the 
Pearl's axioms (directed or undirected axioms, depending on the target model to learn). The 
advantage of this approach is its power for correcting errors of tests by exploiting logically 
the independence axioms of Pearl. When exist independence assertions in the knowledge 
base that are in conflict, it is clear that some independence assertions are incorrect, and this 
approach propose to resolve such conflicts through argumentation (Amgoud and Cayrol, 
2002), which is a defeasible logic used to reason about and correct errors. 

This approach was presented as a more robust conditional independence test called 
the argumentative independence test for learning Bayesian networks in (Bromberg, 2007; 
Bromberg and Margaritis, 2009). Experimental evaluation shows significant improvements 
in the accuracy of the argumentative independence test over other simple statistical tests 
(up to 13%), and improvements on the accuracy of Blanket discovery algorithms such as 
PC and GS (up to 20% in the accuracy). This approach was presented for learning Markov 
networks in (Bromberg, 2007), adapting the learning process for using the set of Pearl's 
axioms for Markov networks shown in Equation 4. 

A disadvantage with this approach is that, as it is a propositional formalism, it requires 
to propositionalizing the set of rules of Pearl, which are first-order. As these are rules 
for super-sets and sub-sets of variables, its propositionalization requires an exponential 
number of propositions, and then, the exact argumentative algorithm proposed is expo- 
nential. In this work an approximate solution is presented with polynomial running time, 
still improving the quality in the experimental evaluation (up to 9%), but making a drastic 
approximation that does not provide theoretical guarantees. 



23 



5 Analysis and open problems 



This section analyzes the surveyed independence-based algorithms present in the literature 
for learning Markov networks, discussing their relative advantages as well as disadvantages 
from a theoretical viewpoint, and describes a series of open problems that remain in the 
area, and where future works may produce some advances. 

5.1 Analysis 

The independence-based algorithms for learning Markov networks are able to learn the inde- 
pendences structure efficiently having the important advantage of being sound, that is, they 
are amenable to proof of correctness, when data is a sampling of a Markov network, the tests 
are reliable, and the underlying distribution is strictly positive. Such algorithms perform 
a succession of statistical independence tests to learn about the conditional independences 
present in data, and assume that those independences are satisfied in the underlying model. 
About its complexity, they can learn the structure performing a polynomial number of tests, 
in the number of variables of the domain re. This fact, together with the evidence that sta- 
tistical tests may run in a proportional time to the number of rows in the input dataset 
D 7 result sometimes in a total execution time polynomial in n and D. Another source of 
efficiency of independence-based algorithms is the capability of learning the independence 
structure without needing an interleaved estimation of the numerical parameters of the 
model, which is the principal source of intractability of score-based algorithms for Markov 
networks. However, there is not any guarantee of correctness for Markov networks obtained 
fitting the parameters for a structure learned by an independence-based approach. 

The independence-based algorithms present in the literature for learning the structure 
of a Markov network are GSMN, GSIMN, PFMN, DGSIMN. Another approach proposed 
is the use of the argumentative independence test. Table 2 shows a summary of the most 
important features of those approaches. The GSMN algorithm is a direct extension of the 
GS algorithm but for Markov networks structure learning, which requires a polynomial 
number of tests, in the number of variables of the domain re. This algorithm is presented 
together with the GSIMN algorithm, which improves the efficiency of GSMN by exploiting 
the Pearl's independence axioms to infer unknown independences from the independences 
observed so far, avoiding the need of performing redundant statistical tests. It is important 
when datasets are large, or when datasets are present in distributed environments. The 
results obtained for GSIMN show savings up to a 40% in running times, obtaining compa- 
rable qualities to GSMN. The PFMN algorithm was designed for improving the efficiency 
of GSIMN. This algorithm does not work in a local-to-global fashion, neither using a model 
for computing efficiently the posterior probability of structures Pr(G | D). The results ob- 
tained by PFMN show improvements in running times up to 90% with respect to GSIMN, 
with equivalent quality of learned structures. Also the DGSIMN algorithm was designed 
for improving the efficiency of GSIMN, by enhancing the fixed ordering of variables in the 
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Table 2: Summary of independence-based Markov network learning algorithms 



Name 


Pub. 

Year 


Comments 


GSMN 


2006 


• Sound in theory 

• The first one independence-based algorithm for Markov net- 
works 

• Use the local-to-global strategy 

• Performs a polynomial number of tests, in the number of vari- 
ables of the domain n. 

• Quality depends on sample complexity of tests 



Sound in theory 

Use the local-to-global strategy 

Use Triangle theorem for reducing number of tests performed 
Useful when using large datasets, or distributed domains 
Savings up to 40% in running times respect to GSMN 
Comparable quality respect to GSMN 



Sound in theory 

Does not use the local-to-global strategy 

Designed for improving efficiency of GSIMN 

Use a generative model of the posterior Pr(G | D) using 

independence-tests 

Useful when using large datasets, or distributed domains 
Savings up to 90% in running times respect to GSIMN 
Comparable quality respect to GSMN and GSIMN 







• 


Sound in theory 






• 


Use the local-to-global strategy 






• 


Designed for improving efficiency of GSIMN 


DGSIMN 


2008 


• 


Use dynamic ordering for reducing number of tests performed 






• 


Useful when using large datasets, or distributed domains 






• 


Savings up to 85% in running times respect to GSIMN 






• 


Comparable quality respect to GSMN and GSIMN 






• 


Novel approach using argumentation to correct errors when 








tests are unreliable 






• 


Use an independence knowledge base. The inconsistencies are 








used to detect errors in tests 


Argumentative 


2009 


• 


Designed for learning Bayesian and Markov networks. 


independence 




• 


Exact algorithm presented is exponential (improving accuracy 


test 






up to 13%) 






• 


Approximate algorithm proposed does not provide theoretical 








guarantees (improving accuracy up to 9%) 



Markov blanket learning subroutine by a dynamic ordering mechanism. Experiments pub- 
lished for DGSIMN show improvements over the running times of GSMIN up to 85%, still 
maintaining the quality of GSMN. 

At this point, it is clear that the most important problem of independence-based al- 
gorithms for learning the structure of Markov networks is the problem of quality, when 



GSIMN 2006 



PFMN 2007 
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statistical independence tests are not reliable. Such problem is not tackled by GSMN, 
GSIMN, DGSIMN and PFMN. This is very important because in real world domains it 
is not possible to know if tests are reliable. The only approach presented for improving 
the quality under uncertainty of tests outcomes is the argumentative independence test. 
Experimental results using this approach show significant improvements in the accuracy 
of the standard independence tests, but exact algorithms presented have an exponential 
cost, and the approximate algorithm proposed, still improving the quality, make a drastic 
approximation that does not provide theoretical guarantees. 

In summary, the advantages of independence-based algorithms for learning Markov net- 
works are overshadowed by the low quality of independence tests when data is scarce, so 
independence-based algorithms are not currently taken into account in practice for learning 
Markov networks. However, there are many important advantages of this approach that 
motivate further work in this area. First, independence-based algorithms are sound and 
efficient. Second, data availability is growing increasingly with the time. Third, there are 
several open problems (enumerated in the next section) whose solutions could result in 
significant improvements in the quality of this technology. 

5.2 Open problems 

Following the analysis of last section, this work concludes by discussing a series of open 
problems that remain in the area, and where future works may produce some advances. 
All the listed problems focus in the quality and the efficiency of the independence-based 
approach for learning Markov networks. 



Open problem 1. Improving the quality of GS. Most independence-based algo- 
rithms surveyed ( GSMN, GSIMN, DGSIMN) learn the Markov blanket of variables 
using the GS algorithm. A source of errors in GS is the heuristics used for ordering, 
that generates cascade errors when statistical tests are unreliable. As unreliability of 
tests is exacerbated with increasing the number of variables involved in the test, it is 
possible that introducing simple variants in the GS algorithm produce tests with less 
variables involved, improving the reliability of tests. This was demonstrated theoreti- 
cally and empirically by the IAMB and HITON algorithms, for Bayesian networks. 

What modifications to GS are appropriate for improving the quality of 
Markov network structure learning algorithms? 



Open problem 2. Independence-based quality measures. The PFMN algorithm 
uses the particle filter approach for optimizing the selection of tests to perform. It 
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utilizes a generative model that computes the posterior probability of independence 
structures given the data, by an approximate method. Interestingly, this posterior 
probability can be efficiently computed, and could be a measure of quality used by an 
optimization method. Such measure of quality has the advantage of avoiding cascade 
errors by assigning probabilities to structures. This is an unexplored area for learning 
the structure of Markov networks. 

Is it possible to adapt the structure posterior computation of PFMN into 
an efficient and sound score ? Would the optimization of such score 
improve the quality of the structures learned? 



Open problem 3. Speeding-up independence-based algorithms. Learning the 
structure when using the independence-based approach requires in some cases the exe- 
cution of a massive amount of statistical independence tests on data. An intermediate 
step in the computation of independence tests is the construction of contingency tables 
from the data, that record the frequency distribution of the variables involved in the 
test. However, it requires reading the whole dataset, and for some problems its size is 
too large. 

Can the contingency tables of some test be reused for inferring the 
contingency tables of other tests? How can an independence- based 
algorithm use such inference mechanism for minimizing the number of 
whole readings of the dataset? Under what conditions this mechanism 
would generate gains in performance? 



Open problem 4. Inconsistencies in local-to- global algorithms. Independence- 
based algorithms using the local-to- global strategy decompose the problem of learning a 
complete independence structure with n variables into n independent Markov blanket 
learning problems. On a second step these algorithms piece-together all the learned 
Markov blankets into a global structure using an "OR rule". Insufficient data may 
result in incorrect learning of Markov blankets, with conflicts in their decision on edge 
inclusion when, for two variables X and Y, X is in the blanket ofY, but Y is not in 
the blanket of X. In such cases the "OR rule" always decides to add the edge, making 
mistakes when such edge does not exist. 

How is it possible to design more robust rules for solving inconsistencies 
between two Markov blankets learned? 



27 



Open problem 5. Comparing independence-based and score-based approaches 

There are several experimental comparisons that lacks in the literature: 

• There is no experimental results published comparing sample complexity of both 
approaches. 

• There is no experimental results published comparing quality of structures learned 
by both approaches. 

• There is no experimental results published comparing quality of complete models: 
i) learned by score-based approach (interleaving structure search and parameters 
estimation) versus ii) models learned by independence-based approach (learning 
the structure and then fitting the parameters only once for such structure). 



Open problem 6. Adapting recent Bayesian network ideas to Markov net- 
works. The first independence-based algorithm proposed is GSMN, an adaptation to 
Markov networks of the GS algorithm. In the literature there are several recent ideas 
for improving the efficiency, quality and sample complexity of GS, as those discussed 
by the authors of IAMB. MMPC/MB, HITON-PC/MB, Fast-IAMB, PCMB and IPC- 
MB algorithms (see Section 3.3.2, for more details). However, all these interesting 
ideas are originally developed and tested for learning the structure of Bayesian net- 
works. 

Can the research of adapting these ideas to the Markov networks 
structure learning problem generate some improvements in the area? 



Open problem 7. Independence knowledge bases. The argumentative indepen- 
dence test improves the accuracy of tests significantly when data is scarce. However, 
the exact algorithm proposed by this approach requires an exponential task, because 
Pearl's axioms are in first-order logics, and knowledge bases are propositional. The 
approximate solution presented is polynomial in running time, still improving the qual- 
ity, but making a drastic approximation that does not provide theoretical guarantees. 
However, exists alternative formalisms for reasoning with inconsistent knowledge bases 
that works efficiently for first-order logics, as the Markov logic networks. 

Can the Pearl's axioms be exploited by an alternative formalism to 

argumentation ? 
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Open problem 8. Relating independence assertions. Statistical tests are pro- 
cedures that run independently to each other, and they are used as a black box by 
independence-based algorithms. Each test responds to a conditional independence 
query only using the input dataset. An implicit assumption made by all the independence- 
based algorithms is that all the independences queried by the algorithm are mutually 
independent to each other given the dataset. This assumption is only true when data 
is sufficiently large for the test to determine the true underlying independence, be- 
cause in this case information of other tests is irrelevant. However, when data is not 
sufficient for correctly determine the independence, tests become dependent given the 
data, i.e., information of other tests may be useful for avoiding errors. An example 
shown in the literature for correcting errors when data is insufficient is the argumen- 
tative independence test, that relates statistical tests through the Pearl's axioms, as 
additional information for improving the quality of tests when data is not sufficient. 

Besides the Pearl's axioms, are there other dependence relations 
governing independence assertions? As in the case of Pearl's axioms, 
can these relations be used as additional information for improving the 
quality of independence-based algorithms? 



Open problem 9. Improving the quality of independence-based algorithms. 

Most of the open problems listed above (namely, Open problems 1, 2, 4> ®> 7 an d 8) 
are based on the same root cause: the independence tests are not reliable when data 
is scarce. Three general approaches were considered for tackling all these problems: 

i) For answering questions of Open Problems 1 and 6: design new algorithms that 
select more reliable tests to execute, for reducing cascade errors. 

ii) For answering questions of Open Problems 2, and 8: modeling the problem 
as a distribution over structures given the data, for assigning probabilities to 
structures, and avoiding cascade errors. 

Hi) For answering questions of Open Problems 4, 7, and 8: detecting inconsistencies 
among independence assertions for correcting errors in statistical tests. 

Are these three approaches redundant or complementary for improving 
the quality of independence-based algorithms? if these approaches were 
complementary, is it possible to develop a sound and efficient formalism 
taking advantage of all such approaches ? 
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6 Conclusions 



The present work discussed the most relevant technical aspects in the problem of learning 
the Markov network structure from data, stressing on independence-based algorithms. In 
the analysis of such technology, this work surveys the current state-of-the-art approaches, 
discussing its current limitations, and a series of open problems where future works may 
produce some advances in the area. The paper concludes by opening a discussion in Open 
problem 9 about how to develop a general formalism that comprises most of the answers 
to several questions of previous open problems, for improving the quality of the structures 
learned, when data is scarce. 
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